Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16979 control: Reduce frequency of hugepage allocation at runtime #15848

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

tanabarr
Copy link
Contributor

@tanabarr tanabarr commented Feb 5, 2025

Reduce the frequency of hugepage allocation change requests made to
the kernel during daos_server start-up. Check total hugepages on start
and only request more from kernel if the recommended number calculated
based on server config file content is more than the existing system total.

On first start of server process after a reboot, allocate arbitrarily
large number e.g. enough for 16*2 engine targets. The ambition being
to allocate once and reduce the chance of fragmentation.

Test-tag: pr daily_regression
Allow-unstable-test: true

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Test-tag-hw-medium: pr daily_regression
Allow-unstable-test: true
Signed-off-by: Tom Nabarro <[email protected]>
@tanabarr tanabarr added bug control-plane work on the management infrastructure of the DAOS Control Plane go Pull requests that update Go code labels Feb 5, 2025
@tanabarr tanabarr self-assigned this Feb 5, 2025
Copy link

github-actions bot commented Feb 5, 2025

Ticket title is 'Mitigation against hugepage memory fragmentation'
Status is 'In Review'
https://daosio.atlassian.net/browse/DAOS-16979

@tanabarr
Copy link
Contributor Author

@phender as discussed in order to try to reproduce the DMA grow failure (DAOS-16979) related to hugepage fragmentation I ran this PR with Test-tag-hw-medium: pr daily_regression to try to trigger the failure. Unfortunately https://build.hpdd.intel.com/blue/organizations/jenkins/daos-stack%2Fdaos/detail/PR-15848/1/pipeline/ didn't hit the issue. any other ideas on how to get a baseline to prove a fix? or any other approaches like just landing a fix and seeing if it has the desired result? I'm going to try to generate a local reproducer as well.

Test-tag: pr daily_regression
Allow-unstable-test: true
Signed-off-by: Tom Nabarro <[email protected]>
@tanabarr tanabarr marked this pull request as ready for review February 26, 2025 11:52
@tanabarr tanabarr requested review from a team as code owners February 26, 2025 11:52
Copy link
Contributor

@mjmac mjmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the ticket and the PR changes correctly, this patch aims to reduce CI instability by reserving a hard-coded maximum amount of hugepages at startup, regardless of the number of bdevs in the configuration.

If that is correct, then I'm not sure I agree with the approach. Changing the product code to work around quirks of the CI testing process seems wrong to me. Instead, it seems like it would be best to ensure that the first configuration of the server has the maximum number of hugepages allocated for the rest of the run. This is a CI problem rather than a product problem, and therefore the solution should be implemented in the test harness, IMO.

minHugepages, maxHugepages, cfgTargetCount, largeTargetCount, msgSysXS)

if minHugepages > maxHugepages {
log.Debugf("config hugepage requirements exceed normal maximum")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be logged at NOTICE level? Who is it for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just a debug message because the user doesn't need to do anything about it, it is an indication that the configuration requires more huge pages than the normal maximum

return FaultConfigHugepagesDisabledWithBdevs
}
if minHugepages != 0 {
log.Noticef("hugepages disabled but targets will be assigned to bdevs, " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't disagree with doing something here, but logging "caution is advised" is not particularly helpful, IMO. Is it an error or not? What is the admin supposed to do if/when they happen to notice this message in the server log?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is to indicate that the server is operating in an unusual mode, the administrator should be aware of that

@tanabarr
Copy link
Contributor Author

If I understand the ticket and the PR changes correctly, this patch aims to reduce CI instability by reserving a hard-coded maximum amount of hugepages at startup, regardless of the number of bdevs in the configuration.

If that is correct, then I'm not sure I agree with the approach. Changing the product code to work around quirks of the CI testing process seems wrong to me. Instead, it seems like it would be best to ensure that the first configuration of the server has the maximum number of hugepages allocated for the rest of the run. This is a CI problem rather than a product problem, and therefore the solution should be implemented in the test harness, IMO.

I don't think this issue is restricted just to CI, IIRC this has been seen outside of our test infrastructure. @NiuYawei requested this change so maybe it's appropriate that he responds to your objection.

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15848/2/execution/node/1420/log

@daltonbohning
Copy link
Contributor

If I understand the ticket and the PR changes correctly, this patch aims to reduce CI instability by reserving a hard-coded maximum amount of hugepages at startup, regardless of the number of bdevs in the configuration.
If that is correct, then I'm not sure I agree with the approach. Changing the product code to work around quirks of the CI testing process seems wrong to me. Instead, it seems like it would be best to ensure that the first configuration of the server has the maximum number of hugepages allocated for the rest of the run. This is a CI problem rather than a product problem, and therefore the solution should be implemented in the test harness, IMO.

I don't think this issue is restricted just to CI, IIRC this has been seen outside of our test infrastructure. @NiuYawei requested this change so maybe it's appropriate that he responds to your objection.

FWIW I've seen similar on Aurora after a fresh reboot:
https://daosio.atlassian.net/browse/DAOS-16921?focusedCommentId=135440
And I've only seen that with master, not 2.6.

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15848/2/execution/node/1565/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15848/2/execution/node/1518/log

@mjmac
Copy link
Contributor

mjmac commented Feb 27, 2025

If I understand the ticket and the PR changes correctly, this patch aims to reduce CI instability by reserving a hard-coded maximum amount of hugepages at startup, regardless of the number of bdevs in the configuration.
If that is correct, then I'm not sure I agree with the approach. Changing the product code to work around quirks of the CI testing process seems wrong to me. Instead, it seems like it would be best to ensure that the first configuration of the server has the maximum number of hugepages allocated for the rest of the run. This is a CI problem rather than a product problem, and therefore the solution should be implemented in the test harness, IMO.

I don't think this issue is restricted just to CI, IIRC this has been seen outside of our test infrastructure. @NiuYawei requested this change so maybe it's appropriate that he responds to your objection.

FWIW I've seen similar on Aurora after a fresh reboot: https://daosio.atlassian.net/browse/DAOS-16921?focusedCommentId=135440 And I've only seen that with master, not 2.6.

I don't doubt that there is a problem... My concern is more that it seems like the actual problem is not yet understood, and the proposed approach in this PR is a potential solution for a very specific set of scenarios. Adding a hard-coded configuration for hugepages kind of defeats the purpose of having a configuration mechanism, and it seems likely to cause unintended problems for configurations that are outside of what's being hard-coded in this PR.

@NiuYawei
Copy link
Contributor

If I understand the ticket and the PR changes correctly, this patch aims to reduce CI instability by reserving a hard-coded maximum amount of hugepages at startup, regardless of the number of bdevs in the configuration.
If that is correct, then I'm not sure I agree with the approach. Changing the product code to work around quirks of the CI testing process seems wrong to me. Instead, it seems like it would be best to ensure that the first configuration of the server has the maximum number of hugepages allocated for the rest of the run. This is a CI problem rather than a product problem, and therefore the solution should be implemented in the test harness, IMO.

I don't think this issue is restricted just to CI, IIRC this has been seen outside of our test infrastructure. @NiuYawei requested this change so maybe it's appropriate that he responds to your objection.

FWIW I've seen similar on Aurora after a fresh reboot: https://daosio.atlassian.net/browse/DAOS-16921?focusedCommentId=135440 And I've only seen that with master, not 2.6.

I don't doubt that there is a problem... My concern is more that it seems like the actual problem is not yet understood, and the proposed approach in this PR is a potential solution for a very specific set of scenarios. Adding a hard-coded configuration for hugepages kind of defeats the purpose of having a configuration mechanism, and it seems likely to cause unintended problems for configurations that are outside of what's being hard-coded in this PR.

Yes, there could be other unknown issues to be solved (As @daltonbohning mentioned that allocation failure was seen after a fresh reboot when the memory isn't supposed to be fragmented), but allocating hugepages at run time (setting nr_hugepages) is believed likely generating fragmentations.

I think our goal is to avoid allocating hugepages at run time when possible, no matter for production or testing system.

Test-tag: pr daily_regression
Allow-unstable-test: true
Signed-off-by: Tom Nabarro <[email protected]>
return nil
} else if minHugepages == 0 {
// Enable minimum needed for scanning NVMe on host in discovery mode.
if cfg.NrHugepages < scanMinHugepageCount && mi.HugepagesTotal < scanMinHugepageCount {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am probably missing something but I was thinking that mi.HugepagesTotal was the number of available Huge Pages, and thus cfg.NrHugePages could not be greater than this first value.

// allocate on numa node 0 (for example if a bigger number of hugepages are
// required in discovery mode for an unusually large number of SSDs).
prepReq.HugepageCount = srv.cfg.NrHugepages
srv.log.Debugf("skip allocating hugepages, no change is required")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it not be an error to have bdev without huge pages ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After chatting with @NiuYawei decided that we should support emulated NVMe with or without hugepages as some usage models may not require them.

Copy link
Contributor

@knard38 knard38 Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense.
NIT, the error FaultConfigHugepagesDisabledWithBdevs raised at line 568 should probably be renamed to something such as FaultConfigHugepagesDisabledWithNvmeBdevs.

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15848/3/execution/node/1220/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15848/3/execution/node/1429/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15848/3/execution/node/1522/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15848/3/execution/node/1569/log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug control-plane work on the management infrastructure of the DAOS Control Plane go Pull requests that update Go code
Development

Successfully merging this pull request may close these issues.

6 participants